Fix SSP restoring in edge cases #104820

janvorli · 2024-07-12T18:37:12Z

There are edge cases when the SSP restoring for continuation after a catch handler completes doesn't work correctly. The problem is caused by the fact that we scan for the Rip of the frame handling the exception on the shadow stack to find where to restore it, and in those edge cases, the same address can be there multiple times and the first occurence is not the right one. For example, when an exception is thrown from a catch handler, it escapes the handler and the handler for the escaped exception is in the same method as the one that invoked the handler.

This change fixes it by finding the SSP of the first managed frame where we search for the handler and then updating the SSP with every unwind. The SSP is stored in the REGDISPLAY. So when we reach the CallCatchFunclet, the REGDISPLAY contains the SSP to restore.

There was also one more issue with restoring the SSP. I turned out that the incsspq instruction uses only the lowest 8 bits of the argument to increment the SSP, so the ClrRestoreNonVolatileContextWorker needs to have a loop that repeats that instruction in case we need to move it by more than 255 slots.

There are edge cases when the SSP restoring for continuation after a catch handler completes doesn't work correctly. The problem is caused by the fact that we scan for the Rip of the frame handling the exception on the shadow stack to find where to restore it, and in those edge cases, the same address can be there multiple times and the first occurence is not the right one. For example, when an exception is thrown from a catch handler, it escapes the handler and the handler for the escaped exception is in the same method as the one that invoked the handler. This change fixes it by finding the SSP of the first managed frame where we search for the handler and then updating the SSP with every unwind. The SSP is stored in the REGDISPLAY. So when we reach the CallCatchFunclet, the REGDISPLAY contains the SSP to restore. There was also one more issue with restoring the SSP. I turned out that the incsspq instruction uses only the lowest 8 bits of the argument to increment the SSP, so the ClrRestoreNonVolatileContextWorker needs to have a loop that repeats that instruction in case we need to move it by more than 255 slots.

VSadov · 2024-07-12T21:41:00Z

src/coreclr/vm/amd64/cgenamd64.cpp

+        // The float updating unwinds the stack so the pRD->pCurrentContext->Rip contains correct unwound Rip
+        // This is used for exception handling and the Rip extracted from m_pCallerReturnAddress is slightly
+        // off, which causes problem with searching for the return address on shadow stack on x64, so
+        // we keep the value from the unwind.


Was it a subtle bug that the Rip was off, or we just did not care about the correct IP in this case?

That's actually expected due to the way how the InlinedCallFrame is created in some cases. Neither the GC stack walk nor the new EH cared, but it caused problem with the newly added search for the first managed frame, as it sometimes is one that called the pinvoke (QCALL).

jkotas · 2024-07-12T21:44:00Z

Do we have test coverage for these corner cases?

janvorli · 2024-07-12T21:44:56Z

Do we have test coverage for these corner cases?

There is no way to really test those, as Windows would happily run and silently fixup things.

VSadov · 2024-07-12T22:30:19Z

Do we have test coverage for these corner cases?

There is no way to really test those, as Windows would happily run and silently fixup things.

Also testing the topmost SSP against what we can figure from SP is rarely helpful either. The top item is often correct, while we did not pop the things properly and the SSP keeps growing, for example.

There is some sensitivity to wrong SSP in hijacking as OS unhijacks using SSP and we use our own stashed value (but we assert that it works the same), so incorrect SSP may lead to asserts. However observing this requires a fairly tight race with suspension.
I wonder if running GC stress with CET enabled would trip on this more often. Although it would still be mostly about the top of the stack, which could be correct while the rest is not.

VSadov · 2024-07-12T22:34:29Z

Maybe there is a way to disable the OS fixup behavior, but then, I'd not be surprised something beyond our control requires it and nothing will run in such mode, even if it was possible to set up.

VSadov · 2024-07-12T22:46:50Z

src/coreclr/vm/amd64/Context.S

+    Update_Loop:
+        cmp     r11, rax
+        cmovb   rax, r11
+        incsspq rax


This is indeed documented as only bits 0-7 are used, regardless of operand size.

Perhaps there is a way to save a byte of assembly by doing INCSSPD eax :-)

The incsspd increments the SSP by multiples of 4, not by 8 :-)

I'd also make it use bits 0-3, because why not ... :-)

janvorli · 2024-07-12T22:47:57Z

Maybe there is a way to disable the OS fixup behavior

There is not, I've explicitly asked Windows folks about it.

VSadov

LGTM. Thanks!

jkotas · 2024-07-13T00:46:01Z

Is it possible for the shadow stack to overflow before this fix - can we use that to build regression test? Or can the effects of this fix be observed by better performance?

If there is really no way to observe the effects of this fix, I am wondering why it is needed. If it is the case, can we simplify things by leaving it to the OS to fix things up as necessary?

VSadov · 2024-07-13T02:47:17Z

Is it possible for the shadow stack to overflow before this fix

yes, if we pop SSP to the fist expected IP match, but we need to pop to some other match, the overflow could be the ultimate result. After enough iterations. It could take many iterations though and very special crafted repro.

jkotas · 2024-07-13T02:58:58Z

We should create the special repro. I would expect that it should be a viable outer loop test at least.

VSadov · 2024-07-13T03:04:51Z

It is possible that we already have tests when ssp grows, but we do not see that since the tests do not run long enough.

I thought about some kind of asserting that shadow stack is roughly the same depth as the regular (i.e. like not twice deep), and use in some random places, but not sure how that can be done in practice, considering native frames on stack.

janvorli · 2024-07-15T13:47:50Z

I will try to create a repro that would result in stack overflow without this change.

janvorli · 2024-07-16T14:55:26Z

@jkotas I have added a test that fails with (shadow) stack overflow before this fix and passes after.

src/tests/Regressions/coreclr/GitHub_104820/test104820.cs

jkotas · 2024-07-16T15:52:11Z

(shadow) stack overflow

Just curious - is there a special status code for shadow stack overflow or does it get reported as regular stack overflow?

janvorli · 2024-07-16T15:54:10Z

Just curious - is there a special status code for shadow stack overflow or does it get reported as regular stack overflow?

It gets reported as regular stack overflow.

jkotas

Thank you for creating the test!

VSadov · 2024-07-16T19:00:04Z

BTW, The test also demonstrates that the fixup performed by Windows is not always correct.

It is a good and perhaps the only possible backward-compat heuristic - pop to the nearest shadow stack item that matches the return site. However, the site may be recorded on the shadow stack multiple times and the code may be returning not to the lowermost occurrence.

janvorli · 2024-07-16T19:41:16Z

BTW, The test also demonstrates that the fixup performed by Windows is not always correct.

I think that it will eventually fix itself after a couple of returns - one return might pick a wrong location, but then a next return will be wrong again, so yet another fixup will be made that finally corrects things.

janvorli added the area-ExceptionHandling-coreclr label Jul 12, 2024

janvorli added this to the 9.0.0 milestone Jul 12, 2024

janvorli requested review from jkotas and VSadov July 12, 2024 18:37

janvorli self-assigned this Jul 12, 2024

VSadov reviewed Jul 12, 2024

View reviewed changes

This was referenced Jul 12, 2024

The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#3008

Open

NuGet restore failing with NullReferenceException #103823

Closed

The job running on agent NetCore-Public ran longer than the maximum time #104044

Closed

Fix linux x64 build

565f9ab

VSadov reviewed Jul 12, 2024

View reviewed changes

VSadov approved these changes Jul 12, 2024

View reviewed changes

VSadov mentioned this pull request Jul 13, 2024

[NativeAOT] When reconciling shadow stack after catch, use more precise way to figure how much to pop. #104652

Merged

Jan Vorlicek added 2 commits July 16, 2024 10:17

Fix release build REGDISPLAY size constant

c54cf5d

Add regression test

a195d54

jkotas reviewed Jul 16, 2024

View reviewed changes

src/tests/Regressions/coreclr/GitHub_104820/test104820.cs Outdated Show resolved Hide resolved

Update src/tests/Regressions/coreclr/GitHub_104820/test104820.cs

e611fc4

jkotas approved these changes Jul 16, 2024

View reviewed changes

janvorli merged commit 4f38f92 into dotnet:main Jul 17, 2024
91 of 95 checks passed

github-actions bot locked and limited conversation to collaborators Aug 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix SSP restoring in edge cases #104820

Fix SSP restoring in edge cases #104820

janvorli commented Jul 12, 2024

VSadov Jul 12, 2024

janvorli Jul 12, 2024

jkotas commented Jul 12, 2024

janvorli commented Jul 12, 2024

VSadov commented Jul 12, 2024 •

edited

Loading

VSadov commented Jul 12, 2024

VSadov Jul 12, 2024

janvorli Jul 12, 2024 •

edited

Loading

VSadov Jul 12, 2024 •

edited

Loading

janvorli commented Jul 12, 2024

VSadov left a comment

jkotas commented Jul 13, 2024

VSadov commented Jul 13, 2024 •

edited

Loading

jkotas commented Jul 13, 2024 •

edited

Loading

VSadov commented Jul 13, 2024 •

edited

Loading

janvorli commented Jul 15, 2024

janvorli commented Jul 16, 2024

jkotas commented Jul 16, 2024

janvorli commented Jul 16, 2024

jkotas left a comment

VSadov commented Jul 16, 2024

janvorli commented Jul 16, 2024

Fix SSP restoring in edge cases #104820

Fix SSP restoring in edge cases #104820

Conversation

janvorli commented Jul 12, 2024

VSadov Jul 12, 2024

Choose a reason for hiding this comment

janvorli Jul 12, 2024

Choose a reason for hiding this comment

jkotas commented Jul 12, 2024

janvorli commented Jul 12, 2024

VSadov commented Jul 12, 2024 • edited Loading

VSadov commented Jul 12, 2024

VSadov Jul 12, 2024

Choose a reason for hiding this comment

janvorli Jul 12, 2024 • edited Loading

Choose a reason for hiding this comment

VSadov Jul 12, 2024 • edited Loading

Choose a reason for hiding this comment

janvorli commented Jul 12, 2024

VSadov left a comment

Choose a reason for hiding this comment

jkotas commented Jul 13, 2024

VSadov commented Jul 13, 2024 • edited Loading

jkotas commented Jul 13, 2024 • edited Loading

VSadov commented Jul 13, 2024 • edited Loading

janvorli commented Jul 15, 2024

janvorli commented Jul 16, 2024

jkotas commented Jul 16, 2024

janvorli commented Jul 16, 2024

jkotas left a comment

Choose a reason for hiding this comment

VSadov commented Jul 16, 2024

janvorli commented Jul 16, 2024

VSadov commented Jul 12, 2024 •

edited

Loading

janvorli Jul 12, 2024 •

edited

Loading

VSadov Jul 12, 2024 •

edited

Loading

VSadov commented Jul 13, 2024 •

edited

Loading

jkotas commented Jul 13, 2024 •

edited

Loading

VSadov commented Jul 13, 2024 •

edited

Loading